Introduction

While the debate about whether human intervention has impacted the pace of climate change is all but settled, the search for solutions to averting the most harmful effects of human activity on climate change is just beginning. Research has shown that one of the primary causes of climate change is the burning of fossil fuels, whose rate has been steadily growing. In Prosperity Without Growth, Tim Jackson argues that humankind cannot continue to realize advances in prosperity through the continued consumption of natural resources, pointing out that staying on the current trajectory of resource consumption is an unsustainable endeavor (Jackson 2011). To the point, Jackson urges researchers and policymakers to explore sustainable avenues of global development.

A common trope which is brought up in most discussions of sustainable development is the question of population: many observers claim that population growth is a critical contributing factor to increased energy consumption, arguing that finding ways to curb population growth is a top priority for curtailing the consumption of natural resources and move us closer towards sustainable development.

We are skeptical that this argument is oversimplified and needs further exploration. Is it fair to place the blame for increased energy consumption on developing countries with high population growth, when many Western countries, especially the United States, are consuming far more energy per person? Are population and energy consumption even correlated? If so, does this correlation hold true at a global, regional, and country level? We believe exploratory data analysis is a fantastic technique to help answer these, and other questions regarding sustainable development and population growth. Therefore, the primary goal of this project is to explore and define the characteristics of the relationship between energy use and population over time.

A secondary objective of this project (and our presentation to the class) is to showcase the important impact that Data Science can have on the field of sustainable development. While finance, advertising, marketing, sales, and tech for the sake of tech all have their place and time, we claim that there are a plethora of other fields which could benefit from the insights provided by Data Science tools and techniques.

All team members contributed at various points to data collection, data cleaning, graph creation, and theoretical discussion. We divided responsibility for the interactive component:

For transparency, all of our findings are completely reproducible. All text and code can be seen in the .rmd file on github at https://github.com/edavfinalproject/final-project/blob/master/final_project.Rmd.

Description of Data

Our project uses three datasets below published by the World Bank:

  1. Population, total
  2. Population growth (annual %)
  3. Energy use (kg of oil equivalent per capita)

Each of these datasets contain per country data from 1960 through 2016. The datasets each contain 15,312 observations on eight variables: COUNTRY_NAME, VALUE, YEAR, COUNTRY_CODE, REGION, INCOMEGROUP, INDICATOR_NAME, INDICATOR_CODE. In order to focus the scope of this project and reduce unnecessary complexity, we decided to focus mainly on each indicator value, country and year; even so, we still discuss some of the other metadata features (region and income group) as part of our conclusion in the Future Work section.

The data was gathered and accessed using the World Bank’s API, cleaned, and combined into one tidy dataframe. Making one tidy dataframe was a critical step which facilitated future collaboration within our team, allowing us all to have time to explore the data on our own. Furthermore, having this tidy dataframe helped ensure that the code for graphs would not change as the data format changed. Finally, it allowed us to easily make graphs designed for “tidy” data. This dataset was the linchpin which made this project possible, and thus naturally suggests future work on an R package, which is also discussed later on in the Future Work section.

The code to pull the data and put it in a data frame can be found in the .rmd file on github. We explored two ways to access the data. Our initial method was to identify the URLs that link to the three datasets, then loop through those URLs to download, unzip, and read the data into R for processing. Due to the API’s intermittent downtime, our next process for accessing the data was to use the WDI package. One advantage to using this package is that the data was already in long format. With the former method, we needed to convert the data frames from wide format. As a convenience, the data in CSV format is already stored in the same directory as the Rmd file, so it is not necessary to download the data again.

When processing the data, we only kept the attributes we were interested in exploring, namely country name and code, year, region, and income group. The attribute names were cleaned up for consistency, and attributes were added to identify the indicator type (e.g. “Population growth (annual %)”). Next, the three data frames were unioned together into one tidy data frame. Lastly, we needed to remove rows that represented various collections of countries as identified by the World Bank (e.g. “Small states”, “European Union”). While it would be interesting to see how these bins of countries interacted with the data, we needed to limit the scope of our exploration to geographic region and income group for the sake of brevity.

# Install and/or load required libraries
needed_packages <- c("data.table", "tidyverse", "shiny", "WDI", "extracat")

for (p in needed_packages) {
     if (!(p %in% installed.packages())){
          install.packages(p, repos = "http://cran.us.r-project.org")
     }
    library(p, character.only = TRUE)
}
# Read in files locally
pop_growth_df <- fread("./API_SP.POP.GROW_DS2_en_csv_v2.csv")
new_header <- str_to_upper(gsub(pattern = " ", replacement = "_", x = names(pop_growth_df)))
setnames(pop_growth_df, new_header)

rm(API_SP.POP.GROW_DS2_en_csv_v2.csv)
pop_growth_meta <- fread("./Metadata_Country_API_SP.POP.GROW_DS2_en_csv_v2.csv")
new_header <- str_to_upper(gsub(pattern = " ", replacement = "_", x = names(pop_growth_meta)))
setnames(pop_growth_meta, new_header)
pop_growth_df <- merge(pop_growth_df, subset(pop_growth_meta, select=-c(4,5,6)), by = "COUNTRY_CODE")

pop_df <- fread("./API_SP.POP.TOTL_DS2_en_csv_v2.csv")
new_header <- str_to_upper(gsub(pattern = " ", replacement = "_", x = names(pop_df)))
setnames(pop_df, new_header)
pop_df$V64 <- NULL
pop_meta <- fread("./Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv")
new_header <- str_to_upper(gsub(pattern = " ", replacement = "_", x = names(pop_meta)))
setnames(pop_meta, new_header)
pop_df <- merge(pop_df, subset(pop_meta, select=-c(4,5,6)), by = "COUNTRY_CODE")

energy_df <- fread("./API_EG.USE.PCAP.KG.OE_DS2_en_csv_v2.csv")
new_header <- str_to_upper(gsub(pattern = " ", replacement = "_", x = names(energy_df)))
setnames(energy_df, new_header)
energy_df_meta <- fread("./Metadata_Country_API_EG.USE.PCAP.KG.OE_DS2_en_csv_v2.csv")
new_header <- str_to_upper(gsub(pattern = " ", replacement = "_", x = names(energy_df_meta)))
setnames(energy_df_meta, new_header)
energy_df <- merge(energy_df, subset(energy_df_meta, select=-c(4,5,6)), by = "COUNTRY_CODE")
 # Alternate way to download World Bank data via the WDI package

# meta_df <- data.frame(WDI_data$country)
# meta_df <- select(meta_df, iso3c, country, region, income)
# setnames(meta_df, c("COUNTRY_CODE", "COUNTRY_NAME", "REGION", "INCOME"))
# 
# pop_growth_df <- WDI(country = "all", indicator = "SP.POP.GROW", 
#                      start = 1960, end = 2017, extra = TRUE)
# pop_growth_df <- select(pop_growth_df, country, SP.POP.GROW, year, iso3c, region, income)
# setnames(pop_growth_df, c("COUNTRY_NAME", "VALUE", "YEAR", 
#                           "COUNTRY_CODE", "REGION", "INCOMEGROUP"))
# pop_growth_df <- pop_growth_df %>% mutate(INDICATOR_NAME = "Population growth (annual %)") %>%
#     mutate(INDICATOR_CODE = "SP.POP.GROW")
# 
# pop_df <- WDI(country = "all", indicator = "SP.POP.TOTL", 
#                      start = 1960, end = 2017, extra = TRUE)
# pop_df <- select(pop_df, country, SP.POP.TOTL, year, iso3c, region, income)
# setnames(pop_df, c("COUNTRY_NAME", "VALUE", "YEAR", 
#                    "COUNTRY_CODE", "REGION", "INCOMEGROUP"))
# pop_df <- pop_df %>% mutate(INDICATOR_NAME = "Population, total") %>%
#     mutate(INDICATOR_CODE = "SP.POP.TOTL")
# 
# energy_df <- WDI(country = "all", indicator = "EG.USE.PCAP.KG.OE", 
#                      start = 1960, end = 2017, extra = TRUE)
# energy_df <- select(energy_df, country, EG.USE.PCAP.KG.OE, year, iso3c, region, income)
# setnames(energy_df, c("COUNTRY_NAME", "VALUE", "YEAR", 
#                       "COUNTRY_CODE", "REGION", "INCOMEGROUP"))
# energy_df <- energy_df %>% mutate(INDICATOR_NAME = "Energy use (kg of oil equivalent per capita)") %>%
#     mutate(INDICATOR_CODE = "EG.USE.PCAP.KG.OE")
# Manipulate local files
# Tidy each of the three data frames
pop_df <- pop_df %>%
    gather(key = "YEAR", value = "VALUE", -COUNTRY_CODE, -COUNTRY_NAME,
           -INDICATOR_NAME, -INDICATOR_CODE, -REGION, -INCOMEGROUP) %>%
    mutate(YEAR = as.numeric(YEAR)) %>%
    mutate(VALUE = as.numeric(VALUE))

energy_df <- energy_df %>%
    gather(key = "YEAR", value = "VALUE", -COUNTRY_CODE, -COUNTRY_NAME,
           -INDICATOR_NAME, -INDICATOR_CODE, -REGION, -INCOMEGROUP) %>%
    mutate(YEAR = as.numeric(YEAR)) %>%
    mutate(VALUE = as.numeric(VALUE))

pop_growth_df <- pop_growth_df %>%
    gather(key = "YEAR", value = "VALUE", -COUNTRY_CODE, -COUNTRY_NAME,
           -INDICATOR_NAME, -INDICATOR_CODE, -REGION, -INCOMEGROUP) %>%
    mutate(YEAR = as.numeric(YEAR)) %>%
    mutate(VALUE = as.numeric(VALUE))

# CREATE INITIAL COMBINED TIDY DATAFRAME
tidy_pop_energy <- rbind(pop_growth_df, pop_df, energy_df, fill = TRUE)
tidy_pop_energy$V63 <- NULL

# CLEAN UP ATTRIBUTES
tidy_pop_energy <- tidy_pop_energy %>%
     mutate(REGION = str_trim(REGION)) %>% #trailing whitespace in some regions
     mutate(YEAR = as.numeric(YEAR)) %>%
     mutate(VALUE = as.numeric(VALUE))

# LIST OF COUNTRIES WHICH ARE NOT ACTUALLY COUNTRIES
not_country = c("Arab World", "Central Europe and the Baltics", "Early-demographic dividend", "Europe & Central Asia (excluding high income)", "Euro area", "Fragile and conflict affected situations", "Heavily indebted poor countries (HIPC)", "IBRD only","IDA total", "Not classified", "Latin America & Caribbean (excluding high income)", "Latin America & Caribbean", "Low income", "Low & middle income", "Late-demographic dividend", "Middle income", "North America", "Post-demographic dividend", "Small states", "East Asia & Pacific (IDA & IBRD countries)", "Latin America & the Caribbean (IDA & IBRD countries)", "Middle East & North Africa (IDA & IBRD countries)", "South Asia (IDA & IBRD)", "World", "Upper middle income", "Sub-Saharan Africa (IDA & IBRD countries)", "Europe & Central Asia (IDA & IBRD countries)", "Sub-Saharan Africa", "Sub-Saharan Africa (excluding high income)", "Pacific island small states", "Pre-demographic dividend", "Other small states", "OECD members", "Middle East & North Africa (excluding high income)", "Middle East & North Africa", "Lower middle income", "Least developed countries: UN classification", "IDA only", "IDA blend", "IDA & IBRD total", "High income", "European Union", "Europe & Central Asia", "East Asia & Pacific", "East Asia & Pacific (excluding high income)", "Caribbean small states", "South Asia")

# FILTER SO ONLY COUNTRIES ARE IN TIDY FRAME, SORT ALPHABETICALLY
tidy_pop_energy <- filter(tidy_pop_energy, !COUNTRY_NAME %in% not_country)
tidy_pop_energy <- tidy_pop_energy %>% arrange(COUNTRY_NAME)

# ADDING IN ENERGY TOTAL (FOR TAB 1)
# DIVIDE POPULATION BY 1 MILLION FOR SCALING; RENAME INDICATOR
pop_df <- pop_df %>% mutate(VALUE = VALUE/1000000) %>% 
  mutate(INDICATOR_NAME = c("Population (millions)"))
# DIVIDE ENERGY PER CAPITA BY 1000; RENAME INDICATOR
energy_df <- energy_df %>% mutate(VALUE = VALUE/1000) %>%
  mutate(INDICATOR_NAME = c("Energy use per capita (teragrams of oil equivalent)"))
# CREATE ENERGY TOTAL COLUMN
energy_total_df <- energy_df %>% mutate(VALUE = VALUE * pop_df$VALUE) %>% mutate(INDICATOR_NAME = c("Energy use total (teragrams of oil equivalent)"))

# ADD ENERGY TOTAL TO DATAFRAME (FOR TAB 1)
tidy_pop_energy_for_tab_1 <- rbind(pop_growth_df, pop_df, energy_df, energy_total_df)
tidy_pop_energy_for_tab_1 <- tidy_pop_energy_for_tab_1 %>%
  mutate(YEAR = as.numeric(YEAR)) %>%
  mutate(VALUE = as.numeric(VALUE))

# CREATE DATAFRAME OF SUMMARY REGION STATS
tidy_overview <- filter(tidy_pop_energy_for_tab_1, COUNTRY_NAME %in% c("East Asia & Pacific", "Europe & Central Asia", "Latin America & Caribbean", "Middle East & North Africa", "North America", "South Asia", "Sub-Saharan Africa"))
tidy_overview <- tidy_overview %>% arrange(COUNTRY_NAME)

# REMOVE NON-COUNTRIES FROM TAB 1 DATAFRAME
tidy_pop_energy_for_tab_1 <- filter(tidy_pop_energy_for_tab_1, !COUNTRY_NAME %in% not_country)
tidy_pop_energy_for_tab_1 <- tidy_pop_energy_for_tab_1 %>% arrange(COUNTRY_NAME)

Analysis of Data Quality

We represented missing data by using the visna function from the extracat package.

The visna graphs show the population data is nearly complete, but the energy use data has a significant amount of missing data. Since the energy use data was the most incomplete, we broke out the data by income level to see if there are any discrepancies in reporting data based on this attribute.

Starting with low income of Energy Use we see the most common pattern for low-income energy usage is that all yearly observations are missing. This is not the case for the remaining three income brackets. This discrepancy shows that there are gaps in data collection in certain parts of the world, which will bias any findings that use these data.

Figure 3-1. Missing Energy Use (Low Income)

# Energy use low-income
wide_energy_low <- filter(tidy_pop_energy, 
                      INDICATOR_NAME == "Energy use (kg of oil equivalent per capita)" & 
                          INCOMEGROUP == "Low income")
wide_energy_low <- select(wide_energy_low, YEAR, COUNTRY_CODE, VALUE)
wide_energy_low <- wide_energy_low %>% spread(YEAR, VALUE)
wide_energy_low <- subset(wide_energy_low, select = -COUNTRY_CODE )

visna(wide_energy_low, sort = 'r')

Moving to middle income countries we see most countries have data between 1975 and 2010. For our main analysis we filtered for this time range to include as many countries as possible.

Figure 3-2. Missing Energy Use (Lower Middle Income)

#Energy use lower-middle income
wide_energy_low_mid <- filter(tidy_pop_energy, 
                      INDICATOR_NAME == "Energy use (kg of oil equivalent per capita)" & 
                          INCOMEGROUP == "Lower middle income")
wide_energy_low_mid <- select(wide_energy_low_mid, YEAR, COUNTRY_CODE, VALUE)
wide_energy_low_mid <- wide_energy_low_mid %>% spread(YEAR, VALUE)
wide_energy_low_mid <- subset(wide_energy_low_mid, select = -COUNTRY_CODE )


visna(wide_energy_low_mid, sort = 'r')

The upper-middle income countries show a similar pattern to middle income in that most countries have data between 1975 and 2010. This further strengthens our decision of filtering data for this time range.

Figure 3-3. Missing Energy Use (Upper Middle Income)

# Energy use Upper-middle income
wide_energy_up_mid <- filter(tidy_pop_energy, 
                      INDICATOR_NAME == "Energy use (kg of oil equivalent per capita)" & 
                          INCOMEGROUP == "Upper middle income")
wide_energy_up_mid <- select(wide_energy_up_mid, YEAR, COUNTRY_CODE, VALUE)
wide_energy_up_mid <- wide_energy_up_mid %>% spread(YEAR, VALUE)
wide_energy_up_mid <- subset(wide_energy_up_mid, select = -COUNTRY_CODE )

visna(wide_energy_up_mid, sort = 'r')

There are multiple high income countries with a full dataset from 1960 to 2016, yet there are nearly as many countries with no data at all.

Figure 3-4. Missing Energy Use (High Income)

# Energy use high income
wide_energy_up <- filter(tidy_pop_energy, 
                      INDICATOR_NAME == "Energy use (kg of oil equivalent per capita)" & 
                          INCOMEGROUP == "High income")
wide_energy_up <- select(wide_energy_up, YEAR, COUNTRY_CODE, VALUE)
wide_energy_up <- wide_energy_up %>% spread(YEAR, VALUE)
wide_energy_up <- subset(wide_energy_up, select = -COUNTRY_CODE )

visna(wide_energy_up, sort = 'r')

Population data was much more robust. Almost all countries had total population and population growth data from 1960 to 2016.

Figure 3-5. Missing Energy Use (Total population)

#total population
wide_pop_total <- filter(tidy_pop_energy, 
                      INDICATOR_NAME == "Population, total")
wide_pop_total <- select(wide_pop_total, YEAR, COUNTRY_CODE, VALUE)
wide_pop_total <- wide_pop_total %>% spread(YEAR, VALUE)
wide_pop_total <- subset(wide_pop_total, select = -COUNTRY_CODE )

visna(wide_pop_total, sort = 'r')

Figure 3-6. Missing Energy Use (Population growth)

#population growth
wide_pop_growth <- filter(tidy_pop_energy, 
                      INDICATOR_NAME == "Population growth (annual %)")
wide_pop_growth <- select(wide_pop_growth, YEAR, COUNTRY_CODE, VALUE)
wide_pop_growth <- wide_pop_growth %>% spread(YEAR, VALUE)
visna(wide_pop_growth, sort = 'r')

To get a broad sense of the interaction between energy consumption and total population, we created new variable for the concatenation of country and year. This technique creates a unique identifier for each observation. It is difficult to see general patterns initially due to outliers, so we removed these in a second scatterplot.

# Scatter plot of energy use and growth rate by country-year
pop_country_year <- pop_df %>%
    mutate(COUNTRY_YEAR = paste0(COUNTRY_CODE,"_",YEAR))

energy_country_year <- energy_df %>%
    mutate(COUNTRY_YEAR = paste0(COUNTRY_CODE,"_",YEAR))

pop_country_year <- pop_country_year %>% mutate(POP_GROWTH_RATE = VALUE)
energy_country_year <- energy_country_year %>% mutate(ENERGY_USE = VALUE)

pop_energy_year <- 
    left_join(pop_country_year, energy_country_year, by = "COUNTRY_YEAR") %>%
    select(COUNTRY_YEAR, POP_GROWTH_RATE, ENERGY_USE) %>%
    filter(POP_GROWTH_RATE != '' & ENERGY_USE != '') %>%
    mutate(POP_GROWTH_RATE = as.numeric(POP_GROWTH_RATE)) %>%
    mutate(ENERGY_USE = as.numeric(ENERGY_USE))

ggplot(pop_energy_year, aes(POP_GROWTH_RATE, ENERGY_USE)) + 
    geom_point(alpha = .1) + geom_density_2d() + 
     xlab("Population Growth Rate") + ylab("Energy use per-capita (kilograms of oil equivelent)") +
     ggtitle("Figure 3-7. Population Growth Rate vs. Energy Use", subtitle = "One observation = values for each country-year combination")

ggplot(pop_energy_year %>% filter(ENERGY_USE <= 10 & POP_GROWTH_RATE <= 100),
       aes(POP_GROWTH_RATE, ENERGY_USE)) + 
    geom_point(alpha = .1) + geom_density_2d() +
     xlab("Population Growth Rate") + ylab("Energy use per-capita (kilograms of oil equivelent)") +
     ggtitle("Figure 3-8. Population Growth Rate vs. Energy Use - Outliers Removed", subtitle = "One observation = values for each country-year combination")

Main Analysis (Exploratory Data Analysis)

The main analysis section has been split into the following subsections:

  1. Overview - High level representation of the relationship between energy and population over time.
  2. Interactive component - Tools for observing numerous facets of the data.
  3. Findings - Key insights and challenges the team faced.

Overview

According to the International Energy Agency, 13.2% of energy in the world comes from renewable sources. Therefore, in order for clean, renewable energy to constitute 100% of total energy consumption, we would need to increase our current clean-energy capacity by 7.5 times. This figure, however, does not take into account the fact that global energy consumption is consistently growing. As shown in Figure 4-1, global energy consumption has increased approximately 2.5 times from 1975 to 2010, with an annual growth rate of 2.6%. Assuming this constant growth rate, global energy consumption in 2100 would be 8 times the current value and 64 times larger than the current supply of clean, renewable energy sources.

# Reform data
icodes <-  unique(tidy_pop_energy$INDICATOR_CODE)
scatter_data <- tidy_pop_energy[tidy_pop_energy$INDICATOR_CODE == icodes[1],]
scatter_data$pop_growth <- as.numeric(tidy_pop_energy$VALUE[tidy_pop_energy$INDICATOR_CODE == icodes[1]])
scatter_data$pop_tot <- as.numeric(tidy_pop_energy$VALUE[tidy_pop_energy$INDICATOR_CODE == icodes[2]])/1000000
scatter_data$energy_percap <- as.numeric(tidy_pop_energy$VALUE[tidy_pop_energy$INDICATOR_CODE == icodes[3]])
scatter_data$energy_tot <- (scatter_data$energy_percap * scatter_data$pop_tot) / 1000
scatter_data <- scatter_data[complete.cases(scatter_data), ]

# Alter dataset
icodes <- unique(tidy_pop_energy$INDICATOR_CODE)
my_data <- tidy_pop_energy[tidy_pop_energy$INDICATOR_CODE == icodes[2],]
my_data$VALUE <- as.numeric(my_data$VALUE)/1000000
my_data$YEAR <- as.numeric(my_data$YEAR)

# ENERGY GROWTH
energy_agg <- aggregate(energy_tot ~ YEAR, scatter_data, sum)
ggplot(energy_agg, aes(YEAR, energy_tot)) + geom_line() +
  labs(title="Figure 4-1. Energy Growth", 
       x="Year", 
       y="Energy use (teragrams of oil equivelent)") +
  coord_cartesian(xlim=c(1975, 2010)) + 
  scale_y_continuous(labels = scales::comma)

These above-mentioned figures suggest that slowing global energy consumption would do much to help society reach the 100% clean energy mark. As discussed in the introduction, one suggested approach to slowing the growth of energy consumption is slowing the growth of global population. Figure 4-2 shows that the global population has increased linearly from 3.7 billion in 1975 to 6.7 billion in 2010.

# POPULATION GROWTH
pop_agg <- aggregate(pop_tot ~ YEAR, scatter_data, sum)
ggplot(pop_agg, aes(YEAR, pop_tot)) + geom_line() +
  labs(title="Figure 4-2. Population Growth", 
       x="Year", 
       y="Population (millions)") +
  coord_cartesian(xlim=c(1975, 2010)) +
  scale_y_continuous(labels = scales::comma)

Over this same period, however, global per capita energy consumption (Figure 4-3) has growth of 42%. This figure alone suggests that population is not the only major factor in play.

# Per capita
energy_capita_agg <- aggregate(energy_percap ~ YEAR, scatter_data, mean)
energy_capita_agg$weighted <- energy_agg$energy_tot / pop_agg$pop_tot
ggplot(energy_capita_agg, aes(YEAR, weighted)) + geom_line() +
  labs(title="Figure 4-3. Energy per Capita", 
       x="Year", 
       y="Energy use per capita (teragrams of oil equivelent)") +
  coord_cartesian(xlim=c(1975, 2010), ylim=c(0,5))

Yet when we look at figures such as 3-7 and 3-8 we learn that when we split by country the smooth correlation immediately goes away. In Figure 4-4, we examine each of the indicators facted by geographic region, and some interesting trends emerge. First, it is remarkable that energy consumption in East Asia & Pacific increased five-fold over the observed period, became the leading energy consumer in the early 2000s, while over that same period the region’s population merely doubled (from roughly 1 billion to just over 2 billion). Further of note is the fact that, despite this growth in population and energy consumption, East Asia & Pacific is in fourth place in terms of per-capita energy consumption, behind North America, Europe & Central Asia, and Latin America & Carribean.

ggplot(data=tidy_overview, aes(x=YEAR, y=VALUE, col=COUNTRY_NAME, ymin=0)) +
  geom_line() +
  facet_wrap(~INDICATOR_NAME, ncol=1, scale="free") +
  labs(title="Figure 4-4. Indicators by Region", 
       y="", x="Year", color="Region")

In the remainder of the main analysis we provide interactive tools explore the relationship between energy growth and population, and then present our main findings from using the tools we made.

Interactive component

As we started to determine what graphs we should show in this analysis, we quickly realized there were so many insightful ways we could split the data, meaning that the report would either be too long or end up skipping many key insights. Therefore, we chose to make interactive R Shiny apps for the main part of our analysis. The R Shiny apps that we created accomplishes two objectives: not only did it allowed our team to easily explore the data, but it also allows users to explore the data in ways that we might not have thought of, allowing them to draw their own conclusions.

We broke the data into separate tabs and views to explore these distinct aspects of the data. We often found ourselves jumping back and forth between views, and encourage users to do the same, as each story of the data does not need to go in chronological order. The four apps are listed below.

  1. Individual Country - Deep dive into how population and energy has changed over time for one country. This view also allows the viewer to quickly see where missing data exists.
  2. Country Comparison - Presents countries side-by-side in a way where magnitudes of energy and population can be easily compared over time.
  3. Energy vs. Population - Presents energy vs population in a scatter plot to visualize correlations over time. Population can be easily compared over time.
  4. Country Races - Compares the path of energy and population over time for two countries.

The idea to include a year slider was inspired by a TED talk by the late Hans Rosling. In order to make the data widely accessible, we have published these R Shiny apps on the Internet at the following location: https://ds3300.shinyapps.io/world_bank_shiny/.

Findings

In these section we present our findings from our exploration of the data and the interactive tools.

Weak correlation

The steady growth of population and energy lead us to believe there was likely a strong correlation. However, the scatter plots below indicate that although there is a direct relationship, the correlation is not strong. Therefore, there is lots of variation in the relationship between energy and population among countries.

# No zoom
ggplot(data = scatter_data[scatter_data$YEAR == 2010, ], 
       aes(x=pop_tot, y=energy_tot, col=REGION)) +
  geom_point(size=5) + 
  labs(title="Figure 4-5. Energy vs. Population (2010)",
       x="Population (millions)",
       y="Energy use (teragrams of oil equivelent)") + 
  coord_cartesian(xlim=c(0, 1500), ylim=c(0, 3000)) +
  scale_x_continuous(labels = scales::comma) + scale_y_continuous(labels = scales::comma) +
  geom_text(aes(label=COUNTRY_NAME), hjust=-0.2, vjust=0) + labs(fill = "Region")

# zoom
ggplot(data = scatter_data[scatter_data$YEAR == 2010, ], 
       aes(x=pop_tot, y=energy_tot, col=REGION)) +
  geom_point(size=5) + 
  labs(title="Figure 4-6. Energy vs. Population, Zoomed (2010)",
       x="Population (millions)",
       y="Energy use (teragrams of oil equivelent)") + 
  coord_cartesian(xlim=c(0, 75), ylim=c(0, 100)) +
  scale_x_continuous(labels = scales::comma) + scale_y_continuous(labels = scales::comma) +
  geom_text(aes(label=COUNTRY_NAME), hjust=-0.2, vjust=0) + labs(fill = "Region")

Distribution across countires

We immediately notice from Figure 4-7 there is large inequality of energy use. China and USA are the two outliers accounting for nearly 40% of the energy use from those countries alone. The top 10 countries account for 65% and the top 50 account for 93%. This indicates the world’s energy use is highly concentrated.

ggplot(data = scatter_data[scatter_data$YEAR == 2010, ], 
       aes(x = reorder(COUNTRY_CODE, energy_tot), y = energy_tot)) +
  geom_col() +   
  labs(title="Figure 4-7. Distribution of Energy (2010)",
       x="Country",
       y="Energy use (teragrams of oil equivelent)") + 
  scale_y_continuous(labels = scales::comma) +
  theme(axis.text.x=element_blank())

The population has high inequality as well, yet the two outlying countries are now India and China. These two countries account for 38% of the total population. The top 10 account for 61% of the population and top 50 account for 91% of the population. Again, we learn the that although there are hundreds of countries, the overall story described in the Overview section is being driven by a small number of countries.

ggplot(data = scatter_data[scatter_data$YEAR == 2010, ], 
       aes(x = reorder(COUNTRY_CODE, pop_tot), y = pop_tot)) +
  geom_col() +   
  labs(title="Figure 4-8. Distribution of Population (2010)",
       x="Country",
       y="Population (Millions)") + 
  scale_y_continuous(labels = scales::comma) +
  theme(axis.text.x=element_blank())

Different paths

When we look at the three standout countries (United States, China, India) we observe three distinct paths to high energy use. The United States grew its energy use without large increases in population. Conversely, China first grew its population and then later the rate of population growth slowed and the energy use grew. India seems on a similar path as China, with its population growth about to slow and energy use about to rise. Further research into what causes these various paths could help countries design more sustainable trajectories.

# Make plot
ggplot(data = scatter_data[scatter_data$COUNTRY_NAME %in% c("China", "United States", "India"), ], 
       aes(x=pop_tot, y=energy_tot, col=COUNTRY_NAME)) +
  geom_point(size=5) + 
  labs(title="Figure 4-9. Paths of Energy Growth",
       x="Population (millions)",
       y="Energy use (teragrams of oil equivelent)") + 
  coord_cartesian(xlim=c(0, 1500), ylim=c(0, 3000)) +
  scale_x_continuous(labels = scales::comma) + scale_y_continuous(labels = scales::comma) + labs(fill = "Country")

Ethics of change

Countries such as the United Kingdom are seeing the energy use per capita shrink. Conversely, a less developed country such as Indonesia has rapid, steady growth in energy use per capita. This may lead us to believe that United Kingdom is on a “good” path, while the trajectory of Indonesia needs to change. However, the United Kingdom’s current energy use per-capita is three times higher than Indonesia. Hence, is it fair to say they do not deserve to grow their country to the same level? This is one of many ethical dilemmas that we must face as we find the right path to sustainability for each country.

# United Kingdom
ggplot(data = scatter_data[scatter_data$COUNTRY_NAME == "United Kingdom", ], 
       aes(x=YEAR, y=(energy_percap/1000))) +
  geom_line() + 
  labs(title="Figure 4-10. Energy Use Per-Capita (United Kingdom)",
       x="Year",
       y="Energy use per capita (teragrams of oil equivelent)") + 
  scale_y_continuous(labels = scales::comma)

# Indonesia
ggplot(data = scatter_data[scatter_data$COUNTRY_NAME == "Indonesia", ], 
       aes(x=YEAR, y=(energy_percap/1000))) +
  geom_line() + 
  labs(title="Figure 4-11. Energy Use Per-Capita (Indonesia)",
       x="Year",
       y="Energy use per capita (teragrams of oil equivelent)") + 
  scale_y_continuous(labels = scales::comma)

Challanges

We needed to address several challenges over the course of this project, both technical and conceptual. The fact that there are many observations (i.e. nations) in addition to a temporal component made it difficult to display all of the information visually at once. Rather than plotting all observations, we often needed to display the data as binned by regional totals (Fig. 4-4) or global totals (Fig 4.1 to 4.3). In other cases, we chose to display all observations for a given year simultaneously (Fig. 4-6) to show that it is difficult to make any strong causal inferences from the data.

The World Bank has an impressive amount of data freely available for researchers on a large number of topics. After reviewing the data available to us, we quickly realized that we would need to be sparing in the number of indicators we chose to explore. For example, we began to look into other World Bank datasets related to energy, but we decided to limit our investigation. Given that they all are formatted similarily, it would have been very easy to incorporate several additional datasets into our study and become bogged down in the immense number of visualization possibilities.

Executive Summary

Moving towards the 100% clean energy will require immense research into the relationship between energy and population in order for each country to successfully design and follow their paths to sustainability. In this executive summary we present our key findings from exploring this relationship for future researchers to build open as they search for sustainable paths. We also hope this story encourages researches to use the interactive tools above to create their own stories about the relationship between population and energy use. Figure 5-1 and 5-2 show steady growth of both energy and population.

# Energy growth
energy_agg <- aggregate(energy_tot ~ YEAR, scatter_data, sum)
ggplot(energy_agg, aes(YEAR, energy_tot)) + geom_line() +
  labs(title="Figure 5-1. Energy Growth", 
       x="Year", 
       y="Energy use (teragrams of oil equivelent)") +
  coord_cartesian(xlim=c(1975, 2010)) + 
  scale_y_continuous(labels = scales::comma)

# Population growth
pop_agg <- aggregate(pop_tot ~ YEAR, scatter_data, sum)
ggplot(pop_agg, aes(YEAR, pop_tot)) + geom_line() +
  labs(title="Figure 5-2. Population Growth", 
       x="Year", 
       y="Population (millions)") +
  coord_cartesian(xlim=c(1975, 2010)) +
  scale_y_continuous(labels = scales::comma)

However, we see there is a lot of information hidden when averaged over the entire world. When we show each country individually in Figure 5-4 the correlation is much less apparent.

ggplot(data = scatter_data[scatter_data$YEAR == 2010, ], 
       aes(x=pop_tot, y=energy_tot, col=REGION)) +
  geom_point(size=5) + 
  labs(title="Figure 5-3. Energy vs. Population (2010)",
       x="Population (millions)",
       y="Energy use (teragrams of oil equivelent)") + 
  coord_cartesian(xlim=c(0, 1500), ylim=c(0, 3000)) +
  scale_x_continuous(labels = scales::comma) + scale_y_continuous(labels = scales::comma) +
  geom_text(aes(label=COUNTRY_NAME), hjust=-0.2, vjust=0)

From the scatter plot above we immediately see there are outliers, specifically China, India, and the United States. The figure below shows these countries took different path to be outliers.

# Make plot
ggplot(data = scatter_data[scatter_data$COUNTRY_NAME %in% c("China", "United States", "India"), ], 
       aes(x=pop_tot, y=energy_tot, col=COUNTRY_NAME)) +
  geom_point(size=5) + 
  labs(title="Figure 5-4. Paths of Energy Growth",
       x="Population (millions)",
       y="Energy use (teragrams of oil equivelent)") + 
  coord_cartesian(xlim=c(0, 1500), ylim=c(0, 3000)) +
  scale_x_continuous(labels = scales::comma) + scale_y_continuous(labels = scales::comma)

We need to determine the right path for each country, and since each country is starting from a different spot, there is not one path for all countries to follow. Below we see the United Kingdom has decreasing energy use per capita while Indonesia is increasing. Yet looking closely reveals that the United Kingdom has three times the energy use per capita as Indonesia. Hence, should we be trying to shrink the UK to Indonesia? Is it fair to let Indonesia to get to the level of the UK? These are just some of the ethical questions we must face as we find the sustainable path for each country to follow.

# United Kingdom
ggplot(data = scatter_data[scatter_data$COUNTRY_NAME == "United Kingdom", ], 
       aes(x=YEAR, y=energy_percap)) +
  geom_line() + 
  labs(title="Figure 5-5. Energy Use Per-Capita (United Kingdom)",
       x="Year",
       y="Energy use per capita (teragrams of oil equivelent)") + 
  scale_y_continuous(labels = scales::comma)

# Indonesia
ggplot(data = scatter_data[scatter_data$COUNTRY_NAME == "Indonesia", ], 
       aes(x=YEAR, y=energy_percap)) +
  geom_line() + 
  labs(title="Figure 5-6. Energy Use Per-Capita (Indonesia)",
       x="Year",
       y="Energy use per capita (teragrams of oil equivelent)") + 
  scale_y_continuous(labels = scales::comma)

indonesia_uk_data <- scatter_data %>% filter(COUNTRY_NAME %in% c("Indonesia", "United Kingdom"))
ggplot(data=indonesia_uk_data, aes(x=YEAR, y=energy_percap, col=COUNTRY_NAME)) +
  geom_line() +
  labs(title="Figure 5-7. Energy Use Per-Capita (Indonesia vs. United Kingdom)",
       x="Year",
       y="Energy use per capita (teragrams of oil equivalent)") +
  scale_y_continuous(labels=scales::comma) +
  coord_cartesian(ylim=c(0, max(indonesia_uk_data$energy_percap)))

Conclusion

Our biggest lesson learned was that our hypothesis – that the strong direct relationship between energy and population growth was not the full story – was correct. When broken down by country, the correlation was simply less visible. We hope this inspires others, as our team has been, to think deeply about how population and energy factor into our paths to sustainability.

Our team was inspired by depth of what could be explored with just the variable energy use, population, year, and country. Nonetheless there are numerous other factors in the relationship between population and energy use. These include economic, health, wealth, and many more. Including these factors in future analyses would be beneficial. Yet the numerous insights is also a challenge. There are many figures that can be made that are not insightful or can be misleading.

Our group quickly realized the tidy dataframe could inspire collaboration beyond our team. Future work could involve organizing all the data sources into an organized database with APIs for R, Python, and other languages for researches to use. Furthermore, we could create an online environment for researchers to share their explorations of the data. We believe this collaboration could greatly enhance countries to successfully define their unique paths to sustainability.

References

Jackson, Tim. 2011. Prosperity Without Growth: Economics for a Finite Planet. Paperback edition. London: Routledge.